Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

نویسندگان

Blaise Hanczar

Jianping Hua

Edward R. Dougherty

چکیده

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Magnetic Calibration of Three-Axis Strapdown Magnetometers for Applications in Mems Attitude-Heading Reference Systems

In a strapdown magnetic compass, heading angle is estimated using the Earth's magnetic field measured by Three-Axis Magnetometers (TAM). However, due to several inevitable errors in the magnetic system, such as sensitivity errors, non-orthogonal and misalignment errors, hard iron and soft iron errors, measurement noises and local magnetic fields, there are large error between the magnetometers'...

متن کامل

بررسی کاربرد روش HEART در سیستم مراقبت‌های بهداشتی و درمانی و صحت نتایج

Introduction: Human error is considered as a crucial challenge in occupational settings. Health care system is amongst occupational environments with high rate of human errors. Numerous preceding studies noted that more than 2/3 of medical errors are preventable. Accordingly, different methods are suggested to evaluate human errors, especially in nuclear industries. The aim of this study was to...

متن کامل

Effect of Errors in Ground Truth on Classification Accuracy

The effect of errors in ground truth on the estimated thematic accuracy of a classifier is considered. A relationship is derived between the true accuracy of a classifier relative to ground truth without errors, the actual accuracy of the ground truth used, and the measured accuracy of the classifier as a function of the number of classes. We show that if the accuracy of the ground truth is kno...

متن کامل

Combining Classifier Guided by Semi-Supervision

The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...

متن کامل